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EMPIRICAL  COMPARISON  OF  CRITERION  REFERENCED  MEASUREMENT  MODE'S1 


ABSTRACT.  ahe  Array  needs  information  about  how  well  an  individual 
can  perform  the  tasks  necessary  for  him  to  do  his  job.  This  information 
is  often  gathered  by  means  of  a  "criterion-referenced  test  a  test  made 
up  of  items  directly  related  to  the  Job  of  interest.  The  test  results 
can  be  used  in  two  ways.  The  first  way  is  to  sort  individuals  into  two 
groups,  one  made  up  of  those  who’can  perform  their  job  satisfactorily 
and  the  other  made  up  of  those  who|do  not  meet  minimal  job  requirements. 

A  second  use  of  the  test  results  is  to  estimate  the  "true"  capability 
of  the  examinees  to  do  the  task  being  tested.  These  two  uses  are  cleaily 
related,  'j  If  one  can  precisely  estimate  an  individual's  capability,  then 
forming  the  two  groups  is  not  a  problem.  On  the  other  hand,  it  may  be 
possible  to  effectively  form  the  two  groups  without  getting  good  esti¬ 
mates  of  "true"  capability. 

Several  psychometric  models  are  available  for  grouping  the  indi¬ 
viduals  and/or  for  estimating  "true"  scores.  For  example,  one  may  simply 
calculate  the  proportion  of  items  correctly  answered  and  use  that  pro¬ 
portion  as  an  estimate  of  "true"  capability.  Alternatively,  a  binominal 
error  model  for  deriving  the  expression  for  the  regression  of  "true"  score 
on  observed  score  can  be  used  and  a  "true"  score  calculated  for  each 
individual.  Other  possible  models  include  a  Bayesian  Model  II  approach 
and  a  latent  trait  model  such  as  the  Rasch  one  parameter  logistic  model. 
Each  of  these  models  yields  a  somewhat  different  estimate  of  "true" 
capability  for  any  given  individual,  iklt  follows  that  the  makeup  of  the 
job  ability  groups  will  vary  from  rao/liF^o  model.  The  purpose  of  this 
research  is  to  empirically  study  the  models  referred  to  above.  What 
is  needed  is  an  appropriate  statistic  l(or  statistics)  and  research 
design  for  comparing  each  model  agains^  all  others  given  the  same  test 
data. 

\ 

I.  INTRODUCTION.  The  purpose  of  this  paper  is  to  elaborate  on 
some  technical  details  and  to  highlight  specific  statistical  and  re¬ 
search  problems  introduced  in  a  previous  paper  by  one  of  the  authors 
(Epstein,  1975). 

Epstein  described  four  procedures  for  estimating  true  scores  from 
observed  scores.  The  first  uses  the  observed  proportion  correct  as  an 
estimate  of  the  true  proportion  correct.  This  procedure  is  straight¬ 
forward  and  familiar.  Hence,  discussion  of  it  will  be  rese-ved  until 
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1  Reprinted  from  the  Proceedings  of  the  Twenty-First  Conference  on  the 
Design  of  Experiments  in  Army  Research  Development  and  Testing, 
sponsored  by  The  Army  Mathematics  Steering  Committee  on  behalf  of  the 
Chief  of  Research,  Development  and  Acquisition,  22-24  October  1975. 


the  problem  of  comparing  the  models  is  developed.  The  other  three  pro 
ccdures  are  1)  a  binomial  error  model,  2)  a  Bayesian  model,  and  3)  tie 
Rasch  logistic  model.  Each  will  bo  discussed  in  detail. 


2. BINOMIAL  ERROR  MODEL.  The  binomial  error  model  (Lord  and 

Novlck,  1968,  pp.  508- S29)  is  based  on  the  assumption  that  the  condi¬ 
tional  distribution  of  observed  score  for  given  proportion  correct  true 
score  (T)  is  the  binomial  distribution. 

h(x| T)  -  (5)  Tx  (1-T)n-X 

x-0,l...n  is  the  number  of  correct  responses  observed  and  n  is  the  total 
number  of  items  on  the  tost. 

It  is  assumed  that  items  are  scored  dichotomous ly,  that  total  score 
for  an  examinee  is  the  number  of  items  answered  correctly,  that  items 
are  locally  independent,  and  that  items  are  equally  difficult  for  a  ' 
given  examinee. 

The  relationship  between  the  observed  score  distribution  and  the 
underlying  true  score  distribution  can  be  written  as  follows: 

♦(x)  -  (5)  f1  R(T)  Tx  (l-T)n-x  <r,  x-0,1, . . .n,  where  ;(x)  is 
o 

the  distribution  of  observed  scores  and  g(T)  is  the  unknown  distribution 
of  true  scores.  • 

It  can  be  shown  that  if  the  regression  of  true  score  on  observed 
score  is  1  Inear  then  the  distribution  of  observed  score,  symbolized  h(x> 
to  distinguish  this  special  case  from  the  general  case  (x) ,  is 
negative  hypergeometric. 

h(x)  ;_bj^J  _  (~n)x  ^x _  x  ■  0,1... n, 

x!" 

where 

a  and  b  are  parameters  to  be  determined  and 
nlxl  =  n(n-l) . . . (n-x+l) , 

(a)x  =  a(a  4-  l)...(a  +x  -1),  n^l  =  (a>Q  =  1. 

The  parameters,  a  and  b,  can  be  expressed  in  terms  of  moments  of  the 
observed  score  distribution 


a  »  (-l+l/c»2i)  ux 
b  -  -a-l+n/rt2i 


l 


The  discussion  thus  far  has  outlined  an  internal  check  of  the 
appropriateness  of  this  model  for  any  given  data  set.  That  is,  if 
one  can  show  adequate  fit  to  the  negative  hypergeometric  distribution 
by  the  observed  scores  then  it  is  reasonable  to  continue  with  this 
model  assuming  linear  regression.  If  adequate  fit  is  not  obtained 
then  either  the  more  general  nonlinear  regression  approach  must  be  used 
or  alternative  models  must  be  identified. 

It  can  be  shown  that  if  the  observed  score  distribution  is  negative 
hypergeometric,  the  true  score  distribution  is  either  the  two  parameter 
beta  distribution,  or  some  other  distribution  having  identical  moments 
up  through  order  n.  In  either  case,  the  regression  of  true  score  on 
observed  score  is  given  by  the  linear  equation 

E  (T|x)  =  a21x  +  (l-a21)wx  ,  x  -  0,1,.. .n . 
n  n 

3.  BAYESIAN  MODEL.  The  Bayesian  model  used  to  evaluate  these  data 
is  described  by  Lewis,  Wang,  and  Novick  (1973).  The  procedure  transforms 
the  binomial  test  score  data  via  an  arc  sine  transformation.  The  re¬ 
sulting  score  is  assumed  to  be  a  sample  from  a  normal  population  with  its 
mean  value  at  the  individual's  transformed  true  ability.  Distributions 
for  the  prior  mean  and  variance  of  the  examinee  group's  transformed 
scores  are  specified  and  posterior  values  calculated.  Finally,  the 
posterior  nfhrginal  distributions  for  the  transformed  scores  are  obtained 
and  estimates  of  individual  true  abilities  on  the  original  (proportion 
correct)  scale  are  calculated.  The  mathematical  details  are  outlined 
below. 

The  Freeman-Tukey  transformation  for  binomial  data  is  used  in 
this  procedure: 


number  of  correct  responses.  The  gj  are  assumed  to  be  normally  dis¬ 
tributed  with  mean  Yj  =  sin-1  •'UJ  and  variance  v  =  (4n+2)_1,  where  Yj 
is  the  transformed  value  of  the  true  proportion  of  correct  responses,  n j . 
The  validity  of  the  assumption  of  normality  and  the  suitability  of  the 
transformation  for  the  procedures  to  follow  can  be  shown  to  be  adequate 
for  examinee  groups  of  at  least  15  persons  and  for  tests  at  least  8  items 
long. 


The  set  of  transformed  variables,  Y j ,  is  assumed  to  be  a  random 
sample  from  a  normal  distribution  with  mean  Up  and  variance  p  .  ur  and 

$p  are  further  assumed  to  be  independent  and  to  have  a  uniform  and  Inverse 
chi-square  distribution  respectively.  Explicit  expressions  for  the  prior 
and  posterior  density  functions  are  given  in  the  Lewis,  et  al.  paper. 
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The  desired  result  of  an  analysis  of  this  kind  is  the  marginal 
posterior  density  function  for  Yj  .  Unfortunately,  an  explicit  ex¬ 
pression  for  it  is  not  obtainable  from  the  joint  posterior  pi  ib *  1  :  t  y 
density  function  of  the  Yj  vector  given  t ho  gj  vector.  Lewi  <  t  al. 
show  methods  for  obtaining  the  marginal  means  and  variances  for  the 
Yj  using  numerical  integration.  However,  they  indicate  that  tor 
large  sample  sizes,  the  conditional  posterior  distribution  of  ij  given 
♦p  and  the  g.  vector  provides  an  acceptable  approximation.  The  con¬ 
ditional  approximation  was  used  for  the  analysis  of  the  data  reported 
in  the  Epstein  paper. 

The  conditional  distribution  of  Yj  given  $p  and  the  gj  vector  can 
be  shown  to  be  normal  with  mean  , 

E  CVj|5r  ,  g) 


and  variance 

var  (YjUr  ,  g) 


where 

J  ■  l,2...m  ■  the  number  of  examinees, 

g  ”  the  vector  of  transformed  scores,  and 

x 

ST 

♦p  **  the  mode  of  fp  given  g  . 


♦p  can  be  obtained  by  solving  the  following  equation: 

(m  ♦  v  +  1)  4*  3  ♦  [  (m  +  2  v  +  3)  v  -  I  (g.  -  g.)'  -  \]  c 
*  i  J  !' 

+  |(v  +  2)v2-2\v]Jr-lv2*0. 

In  the  above  equation,  v  is  the  degrees  of  freedom  for  the  print 
inverse  chi-square  distribution  of  $p  .  Lewis,  et  al.  recommend  that 
a  value  of  eight  be  used  for  most  practical  applications.  \  is  the 
scale  factor  for  the  inverse  chi-square  distribution.  It  can  be 
calculated  by  using  'he  formula 

X  -  v  -  2 
4(t+l) 


*r  Rj  ♦  ve- 

♦  r  +  V 


“  v(<>p  +  m_1v)  , 


*p  +  v 
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where  t  is  interpreted  as  the  number  of  test  lteas  that  the  prior 
lnfoimation  is  considered  to  be  equivalent  to. 

Once  the  Y.  have  been  calculated,  the  last  step  in  the  procedure 
Is  to  calculateJthe  estimates  for  the  true  proportion  correct.  This 
Is  accomplished  by  applying  the  following  equation: 

nj  *  +  1  )  sin^y.  -  1_ 

i  2n  3  4n 

4.  RASCII  MODEL.  The  Rasch  one  parameter  logistic  aodel  (Wright  and 
Panchapakcsan,  1969}  .’u  sunes  that  the  observed  response  an£  of  person 
n  to  item  i  is  governed  by  a  binomial  probability  function  of  person 
ability  Zn  and  item  easiness  E^.  The  probability  of  a  correct  response  is 


p  <«ni  *  1)  -  ZnEi 

The  probability  of  a  wrong  response  is: 

P  (a^  -  0)  -  1  -  P  («ni  •  1)  -  _1 _ 

A+ZnEi 


These  equations  My  be  combined  to  yield 

r  (•„!>  -  (z^,)*-!  . 

»«nEl 

If  we  let  bfl  ■  log  Zn  ***d  dj  ■  log  Ej  , 
then 


P  (*ni)  ’  cxp  (anl<bn  ♦  dl>> 

1  +  cxp  (bn  ♦  dj) 

The  number  of  correct  responses  to  a  given  set  of  items  is  the  only 
Information  needed  to  estlMte  person  ability.  All  persons  who  get  the 
same  score  will  be  estimated  to  have  the  same  ability.  Hence,  in  terms 
of  score  groups, 

r  <*nl>"  e*P  («ni<bJ  +  di>> 

1  +  exp  (bj  +  d|) 

where  J  ■  score  of  person  n,  and  all  persons  with  a  score  j  are  esti- 
mated  to  have  the  same  probability  governing  thslr  responses  to  item  1. 
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The  equations  obtained  when  the  condition  of  a  maximum  likelihood 
is  satisfied  for  the  model  described  in  the  preceding  equation  are: 

k-1 

a+i  “  T.  (rjexp(l)j*  +  di*)/(l+cxp(bj/r+di*))),  i  »  1,2,... k 


j  "  T-  (exp(bj*  +  *3  £  * )  /  ( 1+ex  p  ( b  j  *  +  dj*))),  j  =  1,2, ...k-1 


where  -  number  of  persons  who  get  item  i  correct 

j  “  the  total  test  score,  an  ability  estimate  is 

obtained  for  each  score 

rj  *•  number  of  persons  in  score  group  j. 

bj*,d^*  =  estimates  of  bj  and  dj 

The  method  consists  of  computing  dA*  and  bj*  from  the  implicit  equations 
above.  The  equations  are  handled  ns  two  independent  sets  and  solved 
accordingly. 

An  approximation  of  a  standard  error  for  item  estimates  can  he 
obtained  by  assuming  that  the  variance  of  the  item  estimate  is  due. 
primarily  to  the  uncertainty  in  the  item  score  a+j.  To  a  first 
approximation  this  gives: 

V(d£*)  ~  (adj/an+i)2  v(a+i) 

which  leads  it': 

V(dj*)  r  l/>  (rjexp(hj*+di*)/(l+oxp(bj*  +  dA*))2) . 


The  major  contribution  to  the  error  variance  of  the  ability 
estimate  comes  from  the  variance  in  scores  produced  by  a  given  indi 
vidual.  This  part  of  the  error  variance  depends  upon  the  number  of 
items  and  their  easiness  range. 

An  approximation  of  the  variance  of  the  ability  estinate  b*  is 
given  by 

V"(b*>  -  U  ’0  *)ux;.'(b»}}  ♦  U/C2(t#)} 

•  1*.  (V (d £ ) 4  oxp(d i)/(lfexp(dj+b*)}2}2) 
i 

where  ( •  ( h A )  «  (exp(d j ) / (l+expfb^+dj) ) ^) , 
l 

V(di)  is  the  variance  of  the  item  calibration  d£. 
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The  first  term  in  the  denominator  of  the  V*(b*)  equation  Is  due  to  the 
variance  in  the  score ,  and  the  second  terra  is  due  to  the  imprecision 
of  item  calibration.  The  first  terra  is  always  larger  than  the  second. 

5.  DISCUSSION  OF  Til Z  PROBLEM.  One  characteristic  of  a  useful  model  is 
that  it  has  a  small  error  of  measurement.  That  is,  the  distribution  of 
estimated  scores  for  a  given  true  score*  is  closely  clustered  around  the 
true  score.  The  extent  of  the  measurement  error  that  can  be  expected 
with  a  given  model  is  dependent  on  the  variance  of  the  estimated  true 
score.  For  example,  in  the  proportion  correct  model,  the  variance  of 
the  estimated  true  proportion  correct  is  equal  to  p(l-p)/n.  In  tills 
case  the  variance  of  the  estimate  will  decrease  as  the  number  of  obser¬ 
vations  increases.  Thus  it  would  seen  that  any  level  of  precision  could 
be  obtained  by  simply  adding  observations.  Unfortunately,  for  the  number 
of  items  that  are  usually  practical  on  a  test,  the  level  of  precision 
possible  is  not  completely  satisfactory.  It  would  be  useful  to  conpare 
the  variance  of  the  true  score  estimates  obtained  with  the  other  models 
to  the  proportion  correct  model. 


Therefore  the  question  of  how  to  derive  an  expression  for  the 


variance  of  the  estimated  true  scores  for  the  other  nodels  must  be 
addressed.  An  expression  for  the  binomial  error  model  has  been  derived. 
Since  the  binomial  error  model  results  in  a  regression  equation  it  seems 
reasonable  to  base  the  derivation, on  the  general  forn  of  the  error  of 

estimation,  j  2  /; - -  The  ratio  of  the  variance  of  true 

°E  *  °T  1  ‘  °XT  ’ 


scores  to  the  variance  of  observed  scores  equals  the  reliability  co¬ 
efficient,  o2  where  c  is  the  variance  of  the  true  number 

_c_  -  “21  >  c 
a  2 
x 

correct.  Since  the  true  number  correct  equals  the  true  proportion 

correct  times  the  number  of  items,  C  •  nT,  one  nay  write  „  n2  o2  . 

c  T 

Substituting,  o~  «=  <*2i/n2  .  The  reliability  of  a  test  equals 

the  square  of  the  correlation  between  true  and  observed  scores,  ciji  »  P2 

x 

Hence,  the  variance  of  the  estimated  true  score  can  be  written 


E 


.  °x  °21  (i  -  “21) 
n^ 


For  the  Bayesian  and  Raseh  models  expressions  for  the  variances 
of  the  estimated  true  scores  were  not  derived.  In  the  case  of  the 
Bayesian  model  the  output  is  in  terms  of  the  arc  sine  of  the  true  pro¬ 
portion  correct.  While  the  sampling  distribution  of  the  transformed 
variable  is  known,  the  variance  of  the  estimated  true  proportion  correct 
itself  was  not  determined.  A  similar  problem  exists  for  the  Rasch  model. 
The  sampling  distributions  of  the  ability  and  item  difficulty  indices 
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arc  known  as  well  as  the  explicit  equation  for  calculating,  the  proportion 
correct  from  those  values.  Hut  an  expression  for  the  estimated  true  pio- 
portion  correct  has  not  been  derived.  In  short,  the  problems  ate: 

^1)  For  the  Bayesian  model,  given  the  variance  ot  ,tj  and  tin*  equation 

Nj  •=  (1  +  l/2n)  sin2  -  l/4n,  what  is  the  variance  of  j  ;  and 

(2;  For  the  Rasch  model  given  the  variances  of  b*  and  d*  and  the  equal  ion 

P  (correct)  «  exp(b*  +  d*) _  what  is  the  variance  ot  p? 

1  +  exp  (h*  +  d*)  , 

As  a  result  of  the  discussion  durin  ie  session  a  s, lut ion  to  the 
above  mathematical  problems  seems  to  bi  tilable.  It  was  pointed  out 
that  methods  exist  for  deriving  standard  errors  of  functions  oi  random 
variables.  One  promising  approach  outlined  in  Kendall  and  Stuart  (Idpd, 
p.  231)  involves  evaluating  terms  of  a  Taylor  expansion.  Using  the 
Kendall  and  Stuart  procedure  it  should  be  possible  to  derive  expressions 
for  the  standard  error  of  measurement  for  each  of  the  models.  This  will 
allow  for  formal  comparison  of  the  models  without  real  or  simulated  data. 

The  discussion  then  considered  whether  it  was  possible  to  compari 
the  models  by  obtaining  an  estimate  of  "true  score"  and  comparing  it  to 
the  "real"  true  score.  The  problem  lies  in  obtaining  an  acceptable 
true  score.  Three  approaches  were  considered  and  are  expected  to  pro¬ 
vide  a  basis  for  future  research.  The  first  is  to  base  model  compari¬ 
sons  on  Monte  Carlo  simulation  studies.  Monte  Carlo  studies  provide 
an  unambiguous  true  score  but  suffer  from  their  lack  of  generalizahi 1 itv 
to  practical  applications.  A  second  approach  is  to  define  true  score 
as  the  score  obtained  on  an  instrument  consisting  of  a  large  number  of 
items.  The  models  would  then  be  used  to  estimate  the  true  score  using 
a  smaller  and  more  realistic  number  of  items.  This  approach  is  em¬ 
pirical  and  more  directly  oriented  to  practical  applications  where 
testing  time  and  the  number  of  items  that  may  be  included  in  an  instru¬ 
ment  are  limited.  Although  this  approach  suffers  from  the  fact  that 
the  defined  true  score  is  not  error  free,  the  amount  of  erior  is  not 
likely  to  be  significant  for  practical  purposes.  The  third  approach 
would  investigate  the  possibility  of  applying  Geisser's  predictive 
sample  reuse  method  (Geisser,  1973)  to  the  comparison  of  the  models. 
Geisser’s  method  may  provide  a  more  formal  empirical  approach  to 
model  comparison  than  the  second  approach  discussed  above,  however, 
it  has  not  been  determined  whether  or  not  it  is  applicable  to  this 
research . 

Four  models  for  estimating  true  scores  were  presented  and 
methods  for  comparing  their  outputs  were  discussed.  Procedures  for 
comparing  the  statistical  properties  of  the  models  are  available  and 
relatively  straightforward.  Future  research  will  be  concerned  with 
establishing  the  empirical  validity  of  the  models  and  their  applica¬ 
bility  to  solving  practical  measurement  problems. 
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